In today’s data-driven landscape, understanding the intricate relationships between various factors is vital for effective marketing strategies. Leveraging data analytics, this project aims to construct a discrete Bayesian network to unveil causal relationships between key variables. By asking meaningful questions, we seek to uncover hidden insights that can drive impactful marketing decisions. These insights offer businesses a competitive edge, making investments in data analytics expertise invaluable. Join us as we explore the power of probabilistic modeling to unlock the full potential of data assets.
To initiate our analysis, the first step involves loading the dataset into our R environment. This dataset, stored as “ObesityDataSet.csv”, encapsulates vital information on obesity levels among individuals from specific Latin American countries. Our analysis begins by setting the working directory to where the dataset is located and subsequently loading the dataset into R for inspection and preprocessing.
library(readr)
library(tidyverse)
library(lobstr)
library(dplyr)
library(ggplot2)
library(magrittr)
library(data.table)
library(caret)
library(MASS)
library(purrr)
library(GGally)
library(ggplot2)
library(gplots)
library(pROC)
library(kernlab)
library(stats)
library(e1071)
library(igraph)
library(ggraph)
library(knitr)
library(kableExtra)
# Setting the working directory to where the dataset is located
# Uncomment the appropriate line below to match your directory structure
# setwd("C:/Users/dswag/Desktop/Zadania/OZNAL/Data")
setwd("C:/work/oznal/OZNAL_zadania/Data")
# Listing all files in the current working directory to verify the presence of our dataset
list.files()
## [1] "baysnetw.png" "ObesityDataSet.csv" "vypocet1"
## [4] "vypocet1.pdf" "vypocet2" "vypocet2.pdf"
## [7] "vypocet3" "vypocet3.pdf" "vypocet4"
## [10] "vypocet4.pdf" "vypocet5" "vypocet5.pdf"
# Reading the dataset into R
raw_data <- read_csv("ObesityDataSet.csv", col_names = TRUE, num_threads = 4)
# Displaying the first few rows of the dataset to ensure it's loaded correctly
head(raw_data)
## # A tibble: 6 × 17
## Gender Age Height Weight family_history_with_overw…¹ FAVC FCVC NCP CAEC
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Female 21 1.62 64 yes no 2 3 Some…
## 2 Female 21 1.52 56 yes no 3 3 Some…
## 3 Male 23 1.8 77 yes no 2 3 Some…
## 4 Male 27 1.8 87 no no 3 3 Some…
## 5 Male 22 1.78 89.8 no no 2 1 Some…
## 6 Male 29 1.62 53 no yes 2 3 Some…
## # ℹ abbreviated name: ¹family_history_with_overweight
## # ℹ 8 more variables: SMOKE <chr>, CH2O <dbl>, SCC <chr>, FAF <dbl>, TUE <dbl>,
## # CALC <chr>, MTRANS <chr>, Nobesity <chr>
# Renaming columns in the dataset to make them more descriptive
names(raw_data)[names(raw_data) == "FAVC"] <- "FreqConsHighCalFood"
names(raw_data)[names(raw_data) == "FCVC"] <- "FreqConsVegs"
names(raw_data)[names(raw_data) == "NCP"] <- "NumMainMeals"
names(raw_data)[names(raw_data) == "CAEC"] <- "ConsFoodBetwMeals"
names(raw_data)[names(raw_data) == "CH2O"] <- "ConsWaterDaily"
names(raw_data)[names(raw_data) == "CALC"] <- "ConsAlc"
names(raw_data)[names(raw_data) == "SCC"] <- "CalsConsMon"
names(raw_data)[names(raw_data) == "FAF"] <- "PhysActFreq"
names(raw_data)[names(raw_data) == "TUE"] <- "TimeTechDev"
names(raw_data)[names(raw_data) == "MTRANS"] <- "Trans"
names(raw_data)[names(raw_data) == "family_history_with_overweight"] <- "ObesityFam"
cleaned_data <- raw_data
# Clean and standardize the PhysActFreq and FreqConsVegs columns by rounding the values to the nearest whole number
cleaned_data$PhysActFreq <- round(cleaned_data$PhysActFreq)
cleaned_data$PhysActFreq <- pmin(pmax(cleaned_data$PhysActFreq, 0), 3)
cleaned_data$FreqConsVegs <- round(cleaned_data$FreqConsVegs)
cleaned_data$FreqConsVegs <- pmin(pmax(cleaned_data$FreqConsVegs, 0), 3)
# Create categorical factors from PhysActFreq, FreqConsVegs and Age
cleaned_data$PhysActFreq <- ifelse(cleaned_data$PhysActFreq == 0, 'No activity',
ifelse(cleaned_data$PhysActFreq == 1, 'Low activity',
ifelse(cleaned_data$PhysActFreq == 2, 'Moderate activity', 'High activity')))
cleaned_data$FreqConsVegs <- ifelse(cleaned_data$FreqConsVegs == 1, 'Low consumption',
ifelse(cleaned_data$FreqConsVegs == 2, 'Moderate consumption', 'High consumption'))
cleaned_data <- cleaned_data %>%
mutate(across(where(is.character), as.factor))
str(cleaned_data)
## tibble [2,111 × 17] (S3: tbl_df/tbl/data.frame)
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
## $ ObesityFam : Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
## $ FreqConsHighCalFood: Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
## $ FreqConsVegs : Factor w/ 3 levels "High consumption",..: 3 1 3 1 3 3 1 3 1 3 ...
## $ NumMainMeals : num [1:2111] 3 3 3 3 1 3 3 3 3 3 ...
## $ ConsFoodBetwMeals : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ ConsWaterDaily : num [1:2111] 2 3 2 2 2 2 2 2 2 2 ...
## $ CalsConsMon : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ PhysActFreq : Factor w/ 4 levels "High activity",..: 4 1 3 3 4 4 2 1 2 2 ...
## $ TimeTechDev : num [1:2111] 1 0 1 0 0 0 0 0 1 1 ...
## $ ConsAlc : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
## $ Trans : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
## $ Nobesity : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...
The dataset contains the following columns:
To construct the Bayesian network, we need to select variables that could logically influence each other. Based on the variables available, here’s a selection for the nodes in our network:
\(~\)
\(~\)
Here is the visual representation of the Bayesian Network based on the relationships we discussed. This graph illustrates how different factors such as Gender, Form of transformation (Trans), family history of overweight (ObesityFam), frequent high caloric food consumption (FreqConsHighCalFood), frequency of vegetable consumption (FreqConsVegs), physical activity frequency (PhysActFreq) and the fact that individual is monitoring his calories (CalsConsMon) are hypothesized to influence obesity levels (Nobesity).
What is the joint probability of being male, frequently consuming high caloric food, and being classified as Obesity Type II? - Objective: This question aims to quantify the proportion of the population that is male, consumes high-calorie food frequently, and is classified as Obesity Type II. - Application: Knowing this probability can help businesses and health organizations tailor their nutrition or fitness programs to specifically target men who are most at risk due to their dietary habits. It’s particularly useful for creating personalized diet plans or fitness regimens that cater to individuals with high caloric intake.
What is the probability distribution of obesity categories among individuals with a family history of overweight? - Objective: This seeks to understand how the probability of being in various obesity categories (e.g., Insufficient Weight, Normal Weight, Overweight, Obesity Type I/II/III) is distributed among individuals who have a genetic predisposition to obesity. - Application: The findings from this analysis can be crucial for health marketers and product developers focusing on weight management solutions. It allows for more targeted marketing and product development strategies that consider genetic factors influencing obesity.
These questions, when answered, provide strategic insights that can be leveraged in various ways—from product development and marketing to public health initiatives and policy making. They help in understanding specific subsets of the population, enabling tailored approaches that are likely more effective than broad strategies.
Let’s start by calculating probability distributions of each variable
# Calculate probability distributions
prob_gender <- prop.table(table(cleaned_data$Gender))
prob_trans <- prop.table(table(cleaned_data$Trans))
prob_obesityfam <- prop.table(table(cleaned_data$ObesityFam))
prob_FreqConsHighCalFood <- prop.table(table(cleaned_data$FreqConsHighCalFood))
prob_FreqConsVegs <- prop.table(table(cleaned_data$FreqConsVegs))
prob_PhysActFreq <- prop.table(table(cleaned_data$PhysActFreq))
prob_CalsConsMon <- prop.table(table(cleaned_data$CalsConsMon))
prob_ConsAlc <- prop.table(table(cleaned_data$ConsAlc))
prob_nobesity <- prop.table(table(cleaned_data$Nobesity))
| Probability | |
|---|---|
| Female | 0.4940786 |
| Male | 0.5059214 |
| Probability | |
|---|---|
| Automobile | 0.2164851 |
| Bike | 0.0033160 |
| Motorbike | 0.0052108 |
| Public_Transportation | 0.7484604 |
| Walking | 0.0265277 |
| Probability | |
|---|---|
| no | 0.182378 |
| yes | 0.817622 |
| Probability | |
|---|---|
| no | 0.1160587 |
| yes | 0.8839413 |
| Probability | |
|---|---|
| High consumption | 0.4718143 |
| Low consumption | 0.0483183 |
| Moderate consumption | 0.4798674 |
| Probability | |
|---|---|
| High activity | 0.0563714 |
| Low activity | 0.3675983 |
| Moderate activity | 0.2349597 |
| No activity | 0.3410706 |
| Probability | |
|---|---|
| no | 0.9545239 |
| yes | 0.0454761 |
| Probability | |
|---|---|
| Insufficient_Weight | 0.1288489 |
| Normal_Weight | 0.1359545 |
| Obesity_Type_I | 0.1662719 |
| Obesity_Type_II | 0.1406916 |
| Obesity_Type_III | 0.1534818 |
| Overweight_Level_I | 0.1373757 |
| Overweight_Level_II | 0.1373757 |
| Probability | |
|---|---|
| Always | 0.0004737 |
| Frequently | 0.0331596 |
| no | 0.3027001 |
| Sometimes | 0.6636665 |
What is the joint probability of being male, frequently consuming high caloric food, and being classified as Obesity Type II?
0.0626056
This probability indicates the likelihood of an individual meeting all three conditions simultaneously: being male, frequently consuming high-caloric food, and being classified as Obesity Type II. It suggests that there is a moderate chance of observing such individuals within the population.
For marketing purposes, it suggests potential target segments that may be interested in products or services related to managing obesity, such as weight loss programs, dietary supplements, or fitness equipment.
What is the probability distribution of obesity categories among individuals with a family history of overweight?
The provided probabilities represent the distribution of different obesity categories among individuals who have a family history of overweight. Each probability corresponds to the likelihood of an individual falling into a specific obesity category given their family history of overweight. For example, the highest probability (0.1359517) suggests that Normal Weight is the most common obesity category among individuals with a family history of overweight.
Understanding the distribution of obesity categories within this group allows marketers to prioritize their efforts. For example, if Normal Weight individuals constitute the majority, marketing efforts may focus on preventive health measures or products catering to maintaining a healthy weight.
Given that an individual is female and frequently consumes alcohol, what is the probability that she falls into the Overweight or Obesity categories?
0.7351970
This probability indicates the likelihood of an individual, who is female and frequently consumes alcohol, falling into the Overweight or Obesity categories. It suggests a relatively high probability, implying that females who frequently consume alcohol are more likely to be classified as Overweight or Obese.
Marketers can use this insight to tailor messaging or develop products targeting this specific segment, such as low-calorie alcoholic beverages or fitness programs aimed at weight management.
What is the probability of using public transportation given that an individual has a high frequency of consuming vegetables and is classified under Normal Weight?
0.6006884
This probability represents the likelihood of an individual using public transportation given that they have a high frequency of consuming vegetables and are classified under Normal Weight. It suggests that there is a moderate chance that individuals meeting these conditions prefer public transportation as their mode of commuting.
Marketers can target this segment with advertisements for eco-friendly products or services, such as sustainable transportation options or environmentally conscious brands.
Given that an individual monitors calorie consumption and has a high physical activity frequency, what is the probability that they are not overweight?
0.0869251
This probability indicates the likelihood of an individual, who monitors calorie consumption and has a high frequency of physical activity, not being overweight. It suggests a relatively low probability, implying that even with these healthy lifestyle habits, there’s still a considerable chance of being overweight.
Marketers can use this insight to position their products or services as complementary to a healthy lifestyle rather than solely focusing on weight loss. For example, promoting fitness gear or nutrition supplements as aids in maintaining overall health and well-being.
Segment Identification and Prioritization: Understanding the likelihood of specific consumer characteristics or behaviors allows marketers to identify potential target segments and prioritize their marketing efforts accordingly. By focusing on segments with higher probabilities of certain traits or behaviors, marketers can allocate resources more effectively to reach the most receptive audience.
Segment Targeting and Tailored Messaging: The probabilities provide insights into the preferences and behaviors of different consumer segments. Marketers can use this information to tailor their messaging and develop targeted campaigns that resonate with specific audience segments. By addressing the unique needs and preferences of each segment, marketers can increase the effectiveness of their marketing efforts and enhance consumer engagement.
Behavioral Targeting and Product Positioning: Understanding the relationship between consumer behaviors and outcomes enables marketers to target specific segments more effectively and position their products or services accordingly. By aligning marketing strategies with consumer behaviors and preferences, marketers can create more compelling value propositions and differentiate their offerings in the market.
Overall, these probabilities offer valuable insights that can inform marketing strategies and decision-making processes. By leveraging this data effectively, marketers can optimize their efforts, maximize their return on investment, and better meet the needs and preferences of their target audience.